valid generalization
What Size Net Gives Valid Generalization?
We address the question of when a network can be expected to generalize from m random training examples chosen from some ar(cid:173) bitrary probability distribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs. network size. We show that if m O( log) random exam(cid:173) ples can be loaded on a feedforward network of linear threshold functions with N nodes and W weights, so that at least a fraction 1 - t of the examples are correctly classified, then one has confi(cid:173) dence approaching certainty that the network will correctly classify a fraction 1 - of future test examples drawn from the same dis(cid:173) tribution. Conversely, for fully-connected feedforward nets with one hidden layer, any learning algorithm using fewer than O( '!') random training examples will, for some distributions of examples consistent with an appropriate weight choice, fail at least some fixed fraction of the time to find a weight choice that will correctly classify more than a 1 - fraction of the future test examples.
For Valid Generalization the Size of the Weights is More Important than the Size of the Network
This paper shows that if a large neural network is used for a pattern classification problem, and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. More specifi(cid:173) cally, consider an i-layer feed-forward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A. The misclassification probability con(cid:173) verges to an error estimate (that is closely related to squared error on the training set) at rate O((cA)l(l 1)/2J(log n)jm) ignoring log factors, where m is the number of training patterns, n is the input dimension, and c is a constant. This may explain the gen(cid:173) eralization performance of neural networks, particularly when the number of training examples is considerably smaller than the num(cid:173) ber of weights. It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training.
For Valid Generalization the Size of the Weights is More Important than the Size of the Network
Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size of the Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networks perform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.
For Valid Generalization the Size of the Weights is More Important than the Size of the Network
Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size of the Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networks perform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.
For Valid Generalization the Size of the Weights is More Important than the Size of the Network
Baum and Haussler [4] used these results to give sample size bounds for multi-layer threshold networks Generalization and the Size ofthe Weights in Neural Networks 135 that grow at least as quickly as the number of weights (see also [7]). However, for pattern classification applications the VC-bounds seem loose; neural networks often perform successfully with training sets that are considerably smaller than the number of weights. This paper shows that for classification problems on which neural networksperform well, if the weights are not too big, the size of the weights determines the generalization performance. In contrast with the function classes and algorithms considered in the VC-theory, neural networks used for binary classification problems have real-valued outputs, and learning algorithms typically attempt to minimize the squared error of the network output over a training set. As well as encouraging the correct classification, this tends to push the output away from zero and towards the target values of { -1, I}.
What Size Net Gives Valid Generalization?
Baum, Eric B., Haussler, David
We address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probability distribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs. network size.
What Size Net Gives Valid Generalization?
Baum, Eric B., Haussler, David
We address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probability distribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs. network size.
What Size Net Gives Valid Generalization?
Baum, Eric B., Haussler, David
We address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probabilitydistribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs. network size.
What size net gives valid generalization?
We address the question of when a network can be expected to generalize from m random training examples chosen from some arbitrary probability distribution, assuming that future test examples are drawn from the same distribution. Among our results are the following bounds on appropriate sample vs. network size. We show that if m O(W/ log N/) random examples can be loaded on a feedforward network of linear threshold functions with N nodes and W weights, so that at least a fraction 1 /2 of the examples are correctly classified, then one has confidence approaching certainty that the network will correctly classify a fraction 1 of future test examples drawn from the same distribution. Conversely, for fully-connected feedforward nets with one hidden layer, any learning algorithm using fewer than Ω(W/) random training examples will, for some distributions of examples consistent with an appropriate weight choice, fail at least some fixed fraction of the time to find a weight choice that will correctly classify more than a 1 fraction of the future test examples.